Supervised Latent Semantic Indexing Using Adaptive Sprinkling

نویسندگان

  • Sutanu Chakraborti
  • Rahman Mukras
  • Robert Lothian
  • Nirmalie Wiratunga
  • Stuart N. K. Watt
  • David J. Harper
چکیده

Latent Semantic Indexing (LSI) has been shown to be effective in recovering from synonymy and polysemy in text retrieval applications. However, since LSI ignores class labels of training documents, LSI generated representations are not as effective in classification tasks. To address this limitation, a process called ‘sprinkling’ is presented. Sprinkling is a simple extension of LSI based on augmenting the set of features using additional terms that encode class knowledge. However, a limitation of sprinkling is that it treats all classes (and classifiers) in the same way. To overcome this, we propose a more principled extension called Adaptive Sprinkling (AS). AS leverages confusion matrices to emphasise the differences between those classes which are hard to separate. The method is tested on diverse classification tasks, including those where classes share ordinal or hierarchical relationships. These experiments reveal that AS can significantly enhance the performance of instance-based techniques (kNN) to make them competitive with the state-of-the-art SVM classifier. The revised representations generated by AS also have a favourable impact on SVM performance.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Sprinkled Latent Semantic Indexing for Text Classification with Background Knowledge

In text classification, one key problem is its inherent dichotomy of polysemy and synonym; the other problem is the insufficient usage of abundant useful, but unlabeled text documents. Targeting on solving these problems, we incorporate a sprinkling Latent Semantic Indexing (LSI) with background knowledge for text classification. The motivation comes from: 1) LSI is a popular technique for info...

متن کامل

Supervised Locality Preserving Indexing for Text Categorization

A major characteristic of text categorization problems is the prohibitive high dimensionality of the feature space. Most discrimination methods can not work in such a condition, Latent Semantic Indexing (LSI) has been adopted to solve this problem. However, LSI is not an optimal representation for text categorization task mainly because of two reasons: first, the discriminative categorical info...

متن کامل

Role of semantic indexing for text classification

The Vector Space Model (VSM) of text representation suffers a number of limitations for text classification. Firstly, the VSM is based on the Bag-Of-Words (BOW) assumption where terms from the indexing vocabulary are treated independently of one another. However, the expressiveness of natural language means that lexically different terms often have related or even identical meanings. Thus, fail...

متن کامل

Supervised Semantic Indexing for Ranking Documents

Ranking text documents given a query is one of the key tasks in information retrieval. Typical solutions include classical vector space models using weighted word counts and the cosine similarity (TFIDF) with no machine learning at all, or Latent Semantic Indexing (LSI) using unsupervised learning to learn a low dimensional space of “latent concepts” via a reconstruction objective. The former a...

متن کامل

Efficient semantic indexing via neural networks with dynamic supervised feedback

We describe a portable system for e cient semantic indexing of documents via neural networks with dynamic supervised feedback. We initially represent each document as a modified TF-IDF sparse vector and then apply a learned mapping to a compact embedding space. This mapping is produced by a shallow neural network which learns a latent representation for the textual graph linking words to nearby...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007